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directly to the Shannon entropy subject only to a single constraint: that the average 
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1. Introduction 

Zipf's law [H El E], and power laws in general [U El E], have and continue to attract 
considerable attention in a wide variety of disciplines — from astronomy to demographics 
to software structure to economics to zoology and even to warfare [7]. Typically one 
is dealing with integer-valued observables (numbers of objects, people, cities, words, 
animals, corpses), with n G {1,2,3,...}. Sometimes the range of values is allowed 
to be infinite (at least in principle), sometimes a hard upper bound N is fixed (eg, 
total population if one is interested in subdividing a fixed population into sub-classes). 
Particularly interesting probability distributions are probability laws of the form: 

• Zipf's law: p n oc 1/n. 

• Power laws: p n oc l/n z . 

• Hybrid geometric/power laws: p n oc w n /n z . 

Specifically, a recent model of random group formation [RGF], see reference [8], 
attempts a general explanation of such phenomena based on Jaynes' notion of maximum 
entropy P, HDl EU E21 [13] applied to a particular choice of cost function [8]. (For recent 
related work largely in a demographic context see [HI [151 (HI El EEE] ■ For related work 
in a fractal context implemented using an iterative framework see [T9].) In the present 
article I shall argue that the cost function used in the RGF model is in fact unnecessarily 
complicated, (in fact RGF most typically leads to a hybrid geometric/power law, not a 
pure power law), and that power laws can be obtained in a much simpler way by applying 
maximum entropy ideas directly to the Shannon entropy [201 [21] subject only to a single 
constraint: that the average of the logarithm of the observable quantity is specified. 
Similarly, I would argue that (at least as long as the main issue one is interested in is 
"merely" the minimum requirements for obtaining a power law) the appeal to a fractal 
framework and the iterative model adopted by [19] is also unnecessarily complicated. 

To place this observation in perspective, I will explore several variations on this 
theme, modifying both the relevant state space and the number of constraints, and will 
briefly discuss the relevant special functions of mathematical physics that one encounters 
(zeta functions, harmonic series, poly- logarithms). I shall also discuss an extremely 
general Gibbs-like model, and the use of non-Shannon entropies (the Renyi [22] and 
Tsallis [23] entropies and their generalizations.) There is a very definite trade-off between 
simplicity and generality, and I shall very much focus on keeping the discussion as 
technically simple as possible, and on identifying the simplest model with minimalist 
assumptions. 
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2. Power laws in infinite state space 

Let us define the set of observable quantities to be positive integers n e {1, 2, 3, . . .}, 
without any a priori upper bound. The maximum entropy approach 0, [TDl [HI [121 13] 
seeks to estimate the probabilities p n by maximizing the Shannon entropy [2Q[ 



S = -^2p n \iap n , (1) 

n 

subject to a (small) number of constraints/cost functions — representing our limited 
state of knowledge regarding the underlying process. For example, the RGF model of 
reference [8] uses one constraint and one (relatively complicated) cost function [8], in 
addition to the "trivial" normalization constraint ^2 n p n = 1- Instead, let us consider 
the single constraint 

oo 

(In n) = p n In n = \- (2) 

n=l 

Let us now maximize the Shannon entropy subject to this constraint. This is best done 
by introducing a Lagrange multiplier z corresponding to the constraint (Inn), plus a 
second Lagrange multiplier A corresponding to the "trivial" normalization constraint, 
and considering the quantity: 

too \ / oo \ oo 

^2p n \nn - x J - A I ^Pn - 1 ) - ^p„lnp n . (3) 
n=l / \n=l / n=l 

Of course there is no loss of generality in redefining the Lagrange multiplier A as follows: 

too \ / oo \ oo 

^2p n \nn -x J - (InZ - 1) I ^p n - 1 ) - ^p n hip„. (4) 
n=l / \n=l / n=l 

Varying with respect to the p n yields the extremality condition 

— zlnn — InZ — \np n = 0, (5) 

with explicit solution 

n ~ z 

Pn = ^~y Z = C(z)] z>l. (6) 

Here ((z) is the Riemann zeta function [2"4"l [2^1 [25J EZ], a relatively common and 
well-known special function, and the condition z > 1 is required to make the sum 
Yln°=i n ~ z = converge. This is enough to tell you that one will never exactly 
reproduce Zipf's law (z = 1) in this particular manner, though one can get arbitrarily 
close. The value of the Lagrange multiplier z (which becomes the exponent in the power 
law) is determined self-consistently in terms of x by demanding: 

E^n-'lnn _ dC(^ _dlnC^) 
X[ ' [ ' ((z) C(z) dz • [n 
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Then z G (l,oo) while x £ (0, oo). For practical calculations it is often best to view 
the exponent z as the single free parameter and x{ z ) as the derived quantity, but this 
viewpoint can easily be inverted if desired. Near z = lwe have the analytic estimate 

x ( z ) = (Inn) = -J— - 1 + (z- 1), (8) 

where 7 denotes Euler's constant. At maximum entropy, summing over the extremality 
condition, we have: 

00 

S(z) = S(z)= -J2Pn^Pn = ln((z) + z X (z). (9) 

n=l 

Near z = 1 we have the analytic estimate 

S(z) = S(z) = — *— + In ( — '— ] +1-7 + 0(2-1). (10) 



z-1 \z-l. 
For future reference, it is useful to observe that [274 see p 118 



«z) = ^ + £7^^, (11) 
z — 1 m! 



m=l 



where the Stieltjes constants 7 m satisfy 

m) m _ (lniV) m " 
n m + 1 



A (lnn) m (lniV) m+1 

J2~ - = - Zi— +7m + o(l| 7o = 7- 12 



n=l 

A better estimate, using Euler-Mclaurin summation, is 

^ (hmT _ MN + m^ n f MN+Ml \ n , 

h n ~ m+l +lm+u { (iv+i) 2 {U) 

The quick lesson to take is this: By applying maximum entropy considerations to 
the single constraint (Inn) = x Y on can g e t a pure power law with any exponent z > 1. 
Furthermore, note that the quantity 

00 

exp(lnn) = n Pn (14) 

n=l 

is the geometric mean of the integers {1, 2, 3, . . .} with the exponents weighted by the 
probabilities p n . So one can just as easily obtain the pure power laws considered above 
by maximizing the entropy subject to the constraint that this geometric mean takes on 
a specified value. 
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3. Power laws in finite state space 

If one desires an exact Zipf law, (exponent z — 1, so that p n oc l/n), then because 
the harmonic series diverges, Yln°=i V n = 00 > ^ * s cl ear that something in the above 
formulation needs to change. Perhaps the easiest thing to do is to introduce an explicit 
maximum value of n, call it N, so that we take the set of observables to be positive 
integers n G {1, 2, 3, . . . , N}. (Physicists would call this an infra-red cutoff, or large- 
distance cutoff.) The maximum entropy approach now amounts to considering 

(JV \ / TV \ TV 

^p n lnn -x J - (In Z - 1) I ^p n ~ 1 I ~ ^Pn^Pn- (15) 
n=l / \n=l / n=l 

Varying with respect to the p n and maximizing again yields the same extremality 
condition 

— zlnn — InZ — lnp n = 0, (16) 
but now implying 

n ~ z 
H N (z) 



Here H^(z) is the (reasonably well known) generalized harmonic function [27] 

N 



N 1 



n=l 

Compared with the previous case the only real difference lies in the normalization 
function. However, because the sum is now always finite, there is no longer any 
constraint on the value of the exponent z, in fact we can have z G (—00,00). The 
case z — 1 is Zipf's law, while z = is a uniform distribution, and z < corresponds 
to an "inverted hierarchy" where large values are more common than small values. The 
price paid for this extra flexibility is that that the model now has two free parameters, 
which can be chosen to be z and N. One has the self-consistency constraint 

X{ z, N) = (lnn> = £ "lf:;' n " = - = -gjijfrW . (19) 

H N (z) H N (z) dz 

It is easy to check that \ is now bounded by % G (0, lniV). At maximum entropy we 
now have: 

TV 

S(z,N) = S(z,N)= - Y,Pnlnp n = lnH N (z) + z X (z,N). (20) 

71=1 

Because this is now a two-parameter model, it will always (naively) be a "better" fit 
to observational data than a single-parameter model. Sometimes (for z < 1) retreating 
to this 2-parameter model is necessary, but for z > 1 the one-parameter model of the 
previous section (N — > 00) should be preferred. 
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4. Zipf's law in finite state space 

If for observational or theoretical reasons one is certain that z — 1, (Zipf's law), then 
the model reduces as follows: The state space is n 6 {1, 2, 3, ... , N} where N is now 
the only free parameter. Then explicitly forcing z — > 1 one considers 

(N \ / N \ N 

^2p n lnn - x J - QnZ- 1) I ^p n ~ 1 I ~ ^Pn^Pn- (21) 
n=l / \n=l / n=l 

This is completely equivalent to considering 

(N \ N 

-YtPnHnPn), (22) 
n=l / n=l 

but writing the quantity to be maximized in this way hides the role of the Shannon 
entropy. Varying with respect to the p n and maximizing now yields a (very) slightly 
different extremality condition 

-lnn-lnZ-rnp n = 0, (23) 

and so 

Pn = t^-; Z = H N . (24) 



Here if at is the (ordinary) harmonic number [27] 

N 1 

H N = (25) 



n=l 

Then 

IV 



x (iV) = (lnn> = -L ^— . (26) 
Furthermore, at maximum entropy 

JV 

S(N) = S(N)= - Y,Pnlnp n = lnH N + X (N). (27) 

n=l 

Now we have already seen 



(28) 



and 

N 



n=l • l> ' 



Therefore 



X (N) = (Inn) = l - In(JV + §) + o(l), (30) 



and at maximum entropy 



S(N) = S(N) = l - ln(iV + i) + lnln(iV + §) + o(l). (31) 
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Note that you can use this to estimate the size N of the state space one needs to adopt 
in order to be compatible with the (observed) logarithmic average (Inn). Indeed 

N « exp{2(lnn)} = {exp(lnn)} 2 . (32) 

This relates the size of the state space N to the square of the geometric mean exp(lnn). 



5. Hybrid geometric/power models in infinite state space 

To generate a hybrid geometric/power law model, similar in output to the RGF 
model j8], but with considerably simpler input assumptions, simply take both the 
logarithmic average (Inn) = x, and the arithmetic average (n) = u, to be specified 
- and then maximize the Shannon entropy subject to these two constraints, (plus the 
trivial normalization constraint). That is, introduce two Lagrange multipliers z and w, 
and maximize 

too \ / oo 

^2p n \nn - x I + hiw I y^^n" - /' 
n=l / \n=l 

too \ oo 

^Pn-A-^Pn^Pn- (33) 
n=l J n=l 

Varying with respect to the p n yields 

— z\nn + n\nw — Z — lnp n = 0, (34) 

with solution 

w n n z 

Pn = t . / r ; Z = Li z (w); w < 1. (35) 
Li z (w) 

Here the normalizing constant is the reasonably well-known poly-logarithm function [2 

E51I2BIEZ! 

oo 

W 



U z (w)=^2—; Lii(w) = -ln(l - w). (36) 



n 

71=1 



Then 



while 



1 ^w n \nn dlnLi^H 

X{w, z) = (Inn) = > — = , 37 

Li z [w) ^-^ n z dz 

71=1 



1 ^U7 n n Li z _!(w) dlnLi 2 (u;) 

w, z) = (n) = — > — r = — — — = . 38 

^ ra 2 Li 2 (w) dz 

Furthermore, at maximum entropy 

oo 

S(w, z) = S(w, z) = — ^^p n lnp n = lnLi z (w) + z x(u>, z) — fi(w, z) \nw. (39) 

71=1 

Note that the probability function arising in this model is fully as general as that arising 
in the RGF model [S], but with a somewhat clearer interpretation. 



Zipf's law, power laws, and maximum entropy 



9 



This is because the RGF model uses an unnecessarily complicated "cost function" , 
with an unnecessary degeneracy in the parameters. Indeed, from reference [8] one sees 
(in their notation, slightly different from current notation) 

/cost = J2 P ( k )HkN{k)); P(k) = N(k)/N. (40) 
k 

That is 

/cost= ^2P(k)]n[kNP(k)] (41) 

k 

= p ( k ) ln k + p ( k ) ln N + S p< ^> ln p ( k "> ( 42 ) 

fc fc fc 

= (In k) + In N-S. (43) 

So the RGF cost function [8] is simply a linear combination of Shannon entropy, the 
logarithmic mean (ln k), and a redundant constant offset ln N. (Unfortunately the N of 
reference [8] is not the same as the N used in this article, the RGF N corresponds to the 
number of independent realizations of the underlying statistical process one considers 
- it is the number of simulations, or number of universes in the statistical ensemble). 
One could in addition explicitly restrict the state space to be finite, adding yet another 
free parameter, (M in the language of reference [8], N in the language of this note), but 
there is little purpose in doing so — the key insight is this: Hybrid geometric /power 
laws drop out automatically and straightforwardly by maximizing the Shannon entropy 
subject to the two very simple constraints (Inn) = x an d (n) = u. 



6. Zipf's law: geometric version in infinite state space 

If for observational or theoretical reasons one is certain that z — 1, (Zipf's law), but 
for whatever reason feels a finite state space cutoff N is inappropriate, then a geometric 
version of Zipf's law can be extracted from the hybrid model. Setting z = 1 the model 
reduces as follows: The state space is now n e {1, 2, 3, . . .} while 

too \ / oo 

^2p n lnn - x ) + Into I J^p„^ - H 
n=l / \n=l 

(oo \ oo 

&n-l \-J2Pn^Pn- (44) 
n=l J n=l 

This is completely equivalent to maximizing 

(oo \ / oo \ oo 

^2p n n - fi J - (InZ - 1) I ^2p„ - 1 ) - y^p n ln(n p n ), (45) 
n=l / \n=l / n=l 

but writing the quantity to be maximized in this way hides the role of the Shannon 
entropy. Varying with respect to the p n yields 

— Inn + nlnw — Z — lnp n = 0, (46) 
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1 w n 

P™ = TTn ' Z = \\n(l - w)\; w e (0,1). (47) 

| m(l — w)\ n 

Note the normalizing function is now extremely simple — the natural logarithm. Then 
X (w) = (\nn) = ; ^e(0,l), (48) 

while 

//(tu) = (ra) = -— rr \^w n = — — — -. (49) 

|ln(l- W )| ^ (l-w)\\n(l-w)\ 

Furthermore, at maximum entropy 

oo 

S(w) — S(w) — — ^^Pn Inn = In | ln(l — w)\ + x( w ) + / i ( w ) lnw. (50) 

n=l 

This is a 1-parameter model with a geometrical cutoff, which for u> < 1, (or more 
precisely w — >• 1~), approximates the naive un-normalizable Zipf law with arbitrary 
accuracy. 

7. Very general Gibbs-like model 

Let us now consider an arbitrary number of constraints of the form 

(\ngi(n)} =^2p n \ngi(n) = xu ie(l,# fl ), (51) 

n 

and 

(f a (n)) = Y^Pnfa{n) = u a ; a G (1, #,). (52) 

n 

One could always transform a g constraint into an / constraint or vice versa, but as 
we shall soon see there are advantages to keeping the logarithm explicit. Applying the 
maximum entropy principle amounts to considering 

S= -^2^,, i^2p n \n gi (n) - x) ~ ^ A* \^Pnfa{n) - A 

i \ n / a \ n / 

- (In Z - 1) I ^p n - 1 I - ^p„lnp„, 

\ n J n 



(53) 



where with malice aforethought we have now relabeled the Lagrange multipliers for 
the / constraints as follows: \aw — > —(3. Then maximizing over the p n we have the 
extremality condition 

- ^^m^(n) - ^l3 a faiji) - \nZ - \np n = 0, (54) 

i a 

with explicit solution 

Vn = \ |nft(n)-**J exp|-^/3 a / (n)|; (55) 
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where now the normalizing constant is 
Z(zJ) = J2 

n 

This can be viewed as a generalization/modification of the Gibbs distribution where we 
have explicitly pulled out some of the constraints (the g constraints) to make them look 
power-law-like, while the remaining constraints (the / constraints) are left in Boltzmann- 
like form. Then 

Xi(z,P) = (hi^(n)) = ^p n hift(n) = (57) 

n % 

while 

= </.(")> = = (58) 

apa 

Furthermore, at maximum entropy 

S{z, (3) = S(z, P)= -^2 Pn ]n Pn = In Z(z, /3 ) + J> X i& 0) + J>« ^ ( 59 ) 

n i a 

It is only once one specifies particular choices for the functions f a and gt that the model 
becomes concrete, and only at that stage might one need to focus on particular special 
functions of mathematical physics. The model is extremely general — the drawback is 
that, because it can fit almost anything, it can "explain" almost anything, and so can 
predict almost nothing. 



n 



9i{n) 



exp 



E 



/?a fa{n) 



(56) 



8. Non-Shannon entropies 

Shannon's entropy is by far the best motivated of the entropy functions infesting the 
literature. Without making any particular commitment to the advisability of doing so, 
we can certainly ask what happens if we apply maximum entropy ideas to non-Shannon 
entropies (such as the Renyi [22] or Tsallis [23] entropies and their generalizations). Let 
us define an entropic zeta function by 

Us) = $>„r, (so) 

n 

which certainly converges for s > 1 and may converge on a wider region. Then 

lnCg(l + a) , , l-Cs(l + a) 

^Renyil 1 + a ) = , ^Tsallis (1 + a) = , (61) 

a a 
where in both cases the Shannon entropy is recovered in the limit a — > 0. More generally 
let us consider a generalized entropy of the form 

S(a) = -f (Cs(l + a)) , (62) 

for an arbitrary smooth function /(■). Let us further impose a constraint on the 6 th 
moment 

(n b ) = J2Pnn b = fx b . (63) 

n 
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With power laws being explicitly built in as input into both the generalized entropy and 
the constraint, it is perhaps not too surprising that we will manage to get power laws 
dropping out. One is now interested in maximizing 

S = A (X>„ n b -f i ^j+ w ^ Pn -ij-f (C 5 (l + a)) . (64) 

Varying the p n leads to 

Xn b + w - f (Cs(l + a)) (1 + a)( Pn ) a = 0, (65) 

with solution 

V,= (W + X f Y " : Z = J2( a + Xnr°. (66) 

n 

(An overall factor of (1 + a) f (Cs(l + o)) simply drops out of the calculation.) If the 
number of states is finite, then we cannot a priori discard the parameter w, and we have 
derived a distorted power law. (The probability distribution then interpolates between 
a pure power law for w = and a uniform distribution for w = oo.) If on the other 
hand, the number of states is infinite then normalizability enforces w — > and A factors 
out. We then have a pure power law 

n 

If the state space is the positive integers n G {1,2,3,...} then Z — > ((—b/a), the 
Riemann zeta function, and the sum converges only for — b/a > 1. So one of the two 
parameters (a, b) must be negative. In this situation 

* = <"'> = £ 7r^r' (68) 

C{-b/a) 

now requiring both — b — b/a > 1 and — b/a > 1, while at maximum entropy 

S = S -> / (C(l + „)) = / ( ^ff ) ■ (69) 

So yes, one can also extract power laws from maximum entropy applied to non-Shannon 
entropies, (in particular, the generalized Renyi-Tsallis entropies), but the process is (at 
best) rather clumsy, and seems an exercise in overkill. Apart from the whole question 
of whether or not non-Shannon entropies are particularly interesting, one should ask 
whether the result is particularly useful? This derivation does not seem to be in any 
way an improvement over the simpler one based directly on the Shannon entropy, so its 
utility is dubious. 



9. Summary and Discussion 

The main point of this article is that power laws, (and their variants, including hybrid 
geometric/power laws), have a very natural and straightforward interpretation in terms 
of the maximum entropy formalism pioneered by Jaynes. The key to obtaining a pure 
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power law in the simplest possible manner lies in maximizing the Shannon entropy 
while imposing the simple constraint (In n) = \. Depending on other features of the 
specific model under consideration, detailled analysis leads to certain of the special 
functions of mathematics, (the Riemann zeta function, generalized harmonic functions, 
poly-logarithms, or even ordinary logarithms), but these are relatively well-known 
mathematical objects, which are still tolerably simple to deal with. 

Adding additional features (finite size state space, extra constraints) can (and 
typically will) modify both the functional form and the normalization constants 
appearing in the probability distribution. As always there is a trade-off between 
simplicity and flexibility. A more complicated model (with more free parameters) has a 
more flexible probability distribution, but this comes at a real (if often unacknowledged) 
cost in terms of internal complexity. A rather general Gibbs-like model is laid out and 
briefly discussed. We also briefly discuss applying maximum entropy ideas to non- 
Shannon entropies (such as the Renyi and Tsallis entropies). There is very definitely a 
trade-off in both elegance and plausibility, and I would argue strongly that the simplest 
and most elegant model consists of the Shannon entropy, a constraint on (In n) = \i 
and a trivial normalization constraint on the sum of probabilities. 

The fact that the logarithmic average (Inn) plays such an important role in power 
laws seems to have a connection with the fact that logarithmic scales are ubiquitous in 
classifying various natural and social phenomena. For instance: 

• Stellar magnitudes are logarithmic in stellar luminosity. 

• Earthquake magnitudes (modified Richter scale) are logarithmic in energy release. 

• Sound intensity decibels are logarithmic in pressure. 

• The acidity/alkalinity pH scale is logarithmic in hydrogen ion concentration. 

• Musical octaves are logarithmic in frequency. 

• War severity can be characterized as being logarithmic in casualty count [7]. 

In many cases the utility of a logarithmic scale can be traced back to an approximate 
logarithmic sensitivity in human perceptual systems, but it is very easy to confound 
cause and effect. After all, in the presence of power-law distributed external stimuli, 
there is a significant disadvantage in having the human perceptual system overwhelmed 
by large numbers of low-impact events, suggesting an evolutionary pressure towards 
suppressing sensitivity to low-impact events. This suggests that logarithmic sensitivity 
in human (and animal) perceptual systems is evolutionarily preferred for those senses 
that are subject to an external bath of power-law distributed stimuli. 

Fortunately, for the purposes of applying maximum entropy one does not need to 
know which is the cause and which is the effect - - one is "merely" using Bayesian 
principle to estimate underlying probabilities in the presence of limited knowledge; for 
current purposes this is most typically the single piece of information that (In n) = x- 
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